This tutorial provides an example of conducting cell-type deconvolution on oral squamous cell carcinoma (OSCC) spatial transcriptomics data from Arora and Cao et al, 2023 using the ZI-HGT + CARD. The ZI-HGT, or zero-inflated hierarchical generalized transformation model, is a technique to noisily transform spatial transcriptomics data by reducing zero-inflation, ties, and skewness to achieve approximate normality, and it enables uncertainty quantification. The ZI-HGT is combined here with CARD (Ma et al, 2022) to perform cell-type deconvolution.
This tutorial uses the OSCC spatial transcriptomics data from Arora and Cao et al, 2023, for which the processed Seurat objects can be downloaded here (raw and SpaceRanger processed data are deposited in the National Center of Biotechnology Information’s Gene Expression Omnibus (GEO) database under accession code GSE208253). The single-cell deconvolution reference from Puram et al, 2017 may be found in the GEO database under accession code GSE103322.
We provide small subsets of these publicly available datasets used for demonstration purposes only.
Sources:
Arora and Cao et al., 2023 (GEO: GSE208253)
Puram et al., 2017 (GEO: GSE103322)
Please cite the original studies if using these data in any published work.
The spatial transcriptomics data should be in a matrix or
sparseMatrix format, where rows are genes and columns are named
locations. The spatial locations should be in a dataframe
format with two columns, x and y, representing
the spatial coordinates of each location. The row names of the spatial
locations dataframe should match the column names of the spatial
transcriptomics matrix.
spatial_count <- readRDS("Example_Data/example_spatial_counts.rds")
spatial_location <- readRDS("Example_Data/example_spatial_locations.rds")
as.matrix(spatial_count)[1:5,1:5]
## CACTTCGCCACAGGCT-1 TACGTGCACTATGCTG-1 GCTAAACCTGAGGTGA-1
## AL627309.1 0 0 0
## LINC01409 0 0 0
## LINC01128 0 0 0
## FAM41C 0 0 0
## SAMD11 0 0 0
## CGAGCACTTCAAGTTT-1 TGGGAAATGCCTTTCC-1
## AL627309.1 0 0
## LINC01409 0 0
## LINC01128 0 0
## FAM41C 0 0
## SAMD11 0 0
head(spatial_location, 5)
## x y
## CACTTCGCCACAGGCT-1 21 31
## TACGTGCACTATGCTG-1 21 33
## GCTAAACCTGAGGTGA-1 23 23
## CGAGCACTTCAAGTTT-1 23 25
## TGGGAAATGCCTTTCC-1 23 27
The single-cell RNAseq reference gene expression data should be in a
matrix or sparseMatrix format, where rows are genes
and columns are named cells. The metadata should be in a
dataframe format that includes a sample column
(not required to be named sample) and a cell_type column
(not required to be named cell_type), and the rows of the metadata
should match the column names of the scRNAseq gene expression
matrix.
sc_count <- readRDS("Example_Data/example_sc_counts.rds")
sc_meta <- readRDS("Example_Data/example_sc_metadata.rds")
as.matrix(sc_count)[1:5,1:5]
## HN28_P15_D06_S330_comb HN28_P6_G05_S173_comb HN26_P14_D11_S239_comb
## C9orf152 0.0000 0.0000 0.42761
## RPS11 6.0037 7.3006 7.28850
## ELMO2 0.0000 0.0000 0.00000
## CREB3L1 0.0000 0.0000 0.00000
## PNMA1 5.1474 5.3329 2.83370
## HN26_P14_H05_S281_comb HN26_P25_H09_S189_comb
## C9orf152 0.0000 0.00000
## RPS11 0.0000 7.47420
## ELMO2 5.2465 0.50487
## CREB3L1 0.0000 0.00000
## PNMA1 5.7507 0.19661
head(sc_meta, 5)
## sample cell_type
## HN28_P15_D06_S330_comb HN28 myofibroblast
## HN28_P6_G05_S173_comb HN28 myofibroblast
## HN26_P14_D11_S239_comb HN26 cancer cell
## HN26_P14_H05_S281_comb HN26 myofibroblast
## HN26_P25_H09_S189_comb HN26 cancer cell
The ZI-HGT noisily transforms spatial transcriptomics data by reducing zero-inflation, ties, and skewness to achieve approximate normality through a type of zero-inflated Poisson. \(n\) posterior samples of the transformation are generated, and cell-type deconvolution with CARD is applied to each to generate a posterior distribution of the cell-type proportions. We suggest taking the mean of the posterior distribution as the cell-type proportions point estimate, and we provide 95% pointwise Bayesian credible intervals of each cell-type proportion at each location.
To run the ZI-HGT and CARD, source the functions in the HGTfunctions3.R script (available in this directory) and follow the example below.
# Source the needed functions
source("./HGTfunctions3.R")
# Run the ZI-HGT and CARD
ZI_HGT_CARD_obj <- HGTplusCARD(
sc_count = sc_count,
sc_meta = sc_meta,
spatial_count = spatial_count,
spatial_location = spatial_location,
ct.varname = "cell_type",
ct.select = unique(sc_meta$cell_type),
sample.varname = "sample",
minCountGene = 100,
minCountSpot = 5,
alpha0 = 2,
alpha1 = 0.5,
n = 5,
seed = 123456,
doWAIC = FALSE)
The arguments to the HGTplusCARD function are as follows:
sc_count: a matrix or sparseMatrix of single-cell
RNAseq gene expression data, where rows are genes and columns are named
cells.
sc_meta: a dataframe of single-cell RNAseq metadata,
where the rows match the column names of sc_count and
include a sample column and a cell_type
column.
spatial_count: a matrix or sparseMatrix of spatial
transcriptomics gene expression data, where rows are genes and columns
are named locations.
spatial_location: a dataframe of spatial locations,
where the rows match the column names of spatial_count and
include x and y columns representing the
spatial coordinates.
ct.varname: the name of the column in
sc_meta that contains the cell-type labels.
ct.select: a vector of cell-type labels to include
in the deconvolution.
sample.varname: the name of the column in
sc_meta that contains the sample labels.
minCountGene: the minimum number of gene expression
counts present at a spatial location to include the location in the
analysis.
minCountSpot: the minimum number of spatial
locations where a gene must be expressed to include the gene in the
analysis.
alpha0: the hyperparameter for the zero data
transformation (see the model description in the manuscript). In
practice, we recommend a value near 2, as this centers the posterior
transformations for the zero data at 0.
alpha1: the hyperparameter for the non-zero data
transformation (see the model description in the manuscript). In
practice, we recommend a value near 0.5, as this reduces the size of the
right tail of the data distribution, but not so much as to overwhelm
signal.
n: the number of posterior samples of the cell-type
proportions to generate. We recommend setting n = 100, but for this
example we set n = 5 to speed up the computation.
seed: a random seed for reproducibility.
doWAIC: whether to compute the WAIC for model
selection. In practice, we recommend trying a few combinations of
alpha0 and alpha1 and using the WAIC to select
the best combination. For this example, we set
doWAIC = FALSE to speed up the computation.
The output of the HGTplusCARD function is a list
containing seven elements:
cell_type_proportion_matrices: a list of length
eight containing:
mean_ct_prop: a matrix of the posterior mean
cell-type proportions.
median_ct_prop: a matrix of the posterior median
cell-type proportions.
lower_95_prop: a matrix of the lower bound of the
95% pointwise Bayesian credible interval of the cell-type
proportions.
upper_95_prop: a matrix of the upper bound of the
95% pointwise Bayesian credible interval of the cell-type
proportions.
transformed_ct_prop_mat_list: a list of all
posterior samples of the cell-type proportions.
CARD_var: a matrix of the posterior variance from
CARD of the cell-type proportions.
HGT_var : a matrix of the posterior variance from
the ZI-HGT of the cell-type proportions.
total_var: a matrix of the total posterior variance
of the cell-type proportions.
WAIC: The WAIC (if WAIC = TRUE, otherwise NULL) for
the model fit with parameters alpha0 and
alpha1.
alpha_0: the value of the hyperparameter
alpha0 used in the ZI-HGT.
alpha_1: the value of the hyperparameter
alpha1 used in the ZI-HGT.
n: the number of posterior samples of the cell type
proportions.
original_seed: the random seed used to generate the
posterior samples.
time: the time taken to run the ZI-HGT and
CARD.
We then recommend visualizing the results with HGT.CARD.visualize.prop and HGT.CARD.two.tone, which wrap CARD’s visualization functions to generate plots of the cell-type proportions and a binary plot showing proportions greater than a cutoff, respectively. In the example below, we use a cutoff of 0.1, which, assuming approximately 10 cells per location, indicates at least one cell of that type at the locationn.
ct_props_plot <- HGT.CARD.visualize.prop(
proportion = ZI_HGT_CARD_obj$cell_type_proportion_matrices$mean_ct_prop,
spatial_location = spatial_location,
ct.visualize = unique(sc_meta$cell_type),
colors = c("mediumpurple4","white","green4"),
NumCols = 7,
pointSize = 7.0)
print(ct_props_plot)
ct_props_two_tone_plot <- HGT.CARD.two.tone(
proportion = ZI_HGT_CARD_obj$cell_type_proportion_matrices$mean_ct_prop,
spatial_location = spatial_location,
ct.visualize = unique(sc_meta$cell_type),
colors = c("grey70","green4"),
NumCols = 7,
pointSize = 7.0,
cutoff = 0.1)
print(ct_props_two_tone_plot)
All analysis in this tutorial was performed in R version 4.2.0 and requires the following packages: